Some inital imports:
In [1]:
import pandas as pd
import numpy as np
% matplotlib inline
from matplotlib import pyplot as plt
In [2]:
data = pd.read_csv('../data/data_with_problems.csv', index_col=0)
print('Our dataset has %d columns (features) and %d rows (people).' % (data.shape[1], data.shape[0]))
data.head(15)
Out[2]:
Time to deal with the issues previously found.
Drop the duplicated rows (which have all column values the same), check the YPUQAPSOYJ
row above. Let us use the drop_duplicates
to help us with that by keeping only the first of the duplicated rows.
In [3]:
mask_duplicated = data.duplicated(keep='first')
mask_duplicated.head(10)
Out[3]:
In [4]:
data = data.drop_duplicates(keep='first')
print('Our dataset has now %d columns (features) and %d rows (people).' % (data.shape[1], data.shape[0]))
data.head(10)
Out[4]:
You could also consider a duplicate a row with the same index and same age
only by setting data.drop_duplicates(subset=['age'], keep='first')
, but in our case it would lead to the same result. Note that in general it is not a recommended programming practice to use the argument 'inplace=True'
(e.g., data.drop_duplicates(subset=['age'], keep='first', inplace=True)
) --> may lead to unnexpected results.
This one of the major, if not the biggest, data problems that we faced with. There are several ways to deal with them, e.g.:
In [5]:
missing_data = data.isnull()
print('Number of missing values (NaN) per column/feature:')
print(missing_data.sum())
print('And we currently have %d rows.' % data.shape[0])
That is not terrible to the point of fully dropping a column/feature due to the amount of missing values. Nevertheless, the action to do that would be data.drop('age', axis=1)
. The missing_data
variable is our mask for the missing values:
In [6]:
missing_data.head(8)
Out[6]:
This can be done with dropna()
, for instance:
In [7]:
data_aux = data.dropna(how='any')
print('Dataset now with %d columns (features) and %d rows (people).' % (data_aux.shape[1], data_aux.shape[0]))
This can be done with fillna()
, for instance:
In [8]:
data_aux = data.fillna(value=0)
print('Dataset has %d columns (features) and %d rows (people).' % (data_aux.shape[1], data_aux.shape[0]))
So, what happened with our dataset? Let's take a look where we had missing values before:
In [9]:
data_aux[missing_data['age']]
Out[9]:
In [10]:
data_aux[missing_data['height']]
Out[10]:
In [11]:
data_aux[missing_data['gender']]
Out[11]:
Looks like what we did was not the most appropriate. For instance, we create a new category in the gender
column:
In [12]:
data_aux['gender'].value_counts()
Out[12]:
In [13]:
data['height'] = data['height'].replace(np.nan, data['height'].mean())
data[missing_data['height']]
Out[13]:
In [14]:
data.loc[missing_data['age'], 'age'] = data['age'].median()
data[missing_data['age']]
Out[14]:
In [15]:
data['gender'].value_counts(dropna=False)
Out[15]:
Let's replace MALE
by male
to harmonize our feature.
In [16]:
mask = data['gender'] == 'MALE'
data.loc[mask, 'gender'] = 'male'
# validate we don't have MALE:
data['gender'].value_counts(dropna=False)
Out[16]:
Now we don't have the MALE
entry anymore. Let us fill the missing values with the mode:
In [17]:
the_mode = data['gender'].mode()
# note that mode() return a dataframe
the_mode
Out[17]:
In [18]:
data['gender'] = data['gender'].replace(np.nan, data['gender'].mode()[0])
data[missing_data['gender']]
Out[18]:
In [19]:
data.isnull().sum()
Out[19]: